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Formant 


Formants are defined by Gunnar Fant' 1 ' as 'the spectral 
peaks of the sound spectrum IP(f)l' of the voice. In 
speech science and phonetics, formant is also used to 

ni 

mean an acoustic resonance of the human vocal tract. 

It is often measured as an amplitude peak in the 
frequency spectrum of the sound, using a spectrogram 
(in the figure) or a spectrum analyzer, though in vowels 
spoken with a high fundamental frequency, as in a 
female or child voice, the frequency of the resonance 
may lie between the widely-spread harmonics and 
hence no peak is visible. In acoustics, it refers to a peak 
in the sound envelope and/or to a resonance in sound 
sources, notably musical instruments, as well as that of 
sound chambers. Thus it is possible to talk about the 
formant frequencies of a room, as exploited, for 
example, by Alvin Lucier in his piece I Am Sitting in a Room. 



Formants and phonetics 

Formants are the distinguishing or meaningful frequency components of human speech and of singing. By definition, 
the information that humans require to distinguish between vowels can be represented purely quantitatively by the 
frequency content of the vowel sounds. In speech, these are the characteristic partials that identify vowels to the 
listener. Most of these formants are produced by tube and chamber resonance, but a few whistle tones derive from 
periodic collapse of Venturi effect low-pressure zones. The formant with the lowest frequency is called/ , the second 
/ 0 , and the third f Most often the two first formants,/^ and/ 0 , are enough to disambiguate the vowel. These two 
formants determine the quality of vowels in terms of the open/close and front/back dimensions (which have 
traditionally, though not entirely accurately, been associated with the position of the tongue). Thus the first formant 
/ has a higher frequency for an open vowel (such as [a]) and a lower frequency for a close vowel (such as [i] or [u]); 
and the second formant / 0 has a higher frequency for a front vowel (such as [i]) and a lower frequency for a back 
vowel (such as [u])J 3 ^ Vowels will almost always have four or more distinguishable formants; sometimes there are 
more than six. However, the first two formants are most important in determining vowel quality, and this is often 
displayed in terms of a plot of the first formant against the second formant,^ though this is not sufficient to capture 
some aspects of vowel quality, such as rounding 

Nasals usually have an additional formant around 2500 Hz. The liquid [1] usually has an extra formant at 1500 Hz, 
while the English "r" sound ([j]) is distinguished by virtue of a very low third formant (well below 2000 Hz). 

Plosives (and, to some degree, fricatives) modify the placement of formants in the surrounding vowels. Bilabial 
sounds (such as 'b' and 'p' as in "ball" or "sap") cause a lowering of the formants; velar sounds ('k' and 'g' in English) 
almost always show/ 2 and/ 3 coming together in a 'velar pinch' before the velar and separating from the same 'pinch' 
as the velar is released; alveolar sounds (English 't' and 'd') cause less systematic changes in neighboring vowel 
formants, depending partially on exactly which vowel is present. The time-course of these changes in vowel formant 
frequencies are referred to as 'formant transitions'. 

If the fundamental frequency of the underlying vibration is higher than a resonance frequency of the system, then the 
formant usually imparted by that resonance will be mostly lost. This is most apparent in the example of soprano 
opera singers, who sing high enough that their vowels become very hard to distinguish. 
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Control of resonances is an essential component of the vocal technique known as overtone singing, in which the 
performer sings a low fundamental tone, and creates sharp resonances to select upper harmonics, giving the 
impression of several tones being sung at once. 

Spectrograms are used to visualise formants. 

Vowel formant centers 


Vowel (IPA) 

Formant^ 

Formant / 

u 

320 Hz 

800 Hz 

o 

500 Hz 

1000 Hz 

a 

700 Hz 

1150 Hz 

a 

1000 Hz 

1400 Hz 

0 

500 Hz 

1500 Hz 

y 

320 Hz 

1650 Hz 

8 

700 Hz 

1800 Hz 

e 

500 Hz 

2300 Hz 

i 

320 Hz 

2500 Hz 


Vowel formants 


Vowel 

Main formant region 

u 

200-400 Hz 

o 

400-600 Hz 

a 

800-1200 Hz 

e 

400-600 and 2200-2600 Hz 

i 

200-400 and 3000-3500 Hz 


Singers' formant 

Studies of the frequency spectrum of trained singers, especially male singers, indicate a clear formant around 
3000 Hz (between 2800 and 3400 Hz) that is absent in speech or in the spectra of untrained singers. It is thought to 

[7i 

be associated with one or more of the higher resonances of the vocal tract. It is this increase in energy at 3000 Hz 
which allows singers to be heard and understood over an orchestra, which peak at much lower frequencies of around 
500 Hz. This formant is actively developed through vocal training, for instance through so-called "voce di strega " or 
witch's voice^ exercises and is caused by a part of the vocal tract acting as a resonator/ 9 ^ 
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External links 

• Formants for fun and profit (http://homepage.ntu.edu.tw/~karchung/Phonetics II page nineteen.htm) 

• Formants and wah-wah pedals (http://www.geofex.com/Article_Folders/wahpedl/voicewah.htm) 

• What is a formant? (http://www.phys.unsw.edu.au/jw/formant.html) A discussion of the three different 
meanings of the word 'formant' 

• Formant tuning by soprano singers (http://www.phys.unsw.edu.au/jw/soprane.html) from the University of 
New South Wales 

• The acoustics of harmonic or overtone singing (http://www.phys.unsw.edu.au/jw/xoomi.html) from the 
University of New South Wales 

• Materials for measuring and plotting vowel formants (http://videoweb.nie.edu.sg/phonetic/vowels/ 
measurements. html) 

• Acoustics of the Vowel (http://www.vowel.ch/acoustics/index.htm) A discussion of possible formant 
variations without affection of the phoneme identity. 
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Speech synthesis 


Speech synthesis is the artificial production of human speech. A 
computer system used for this purpose is called a speech synthesizer, 
and can be implemented in software or hardware. A text-to-speech 
(TTS) system converts normal language text into speech; other 
systems render symbolic linguistic representations like phonetic 
transcriptions into speech. ^ 

Synthesized speech can be created by concatenating pieces of recorded 

speech that are stored in a database. Systems differ in the size of the 

stored speech units; a system that stores phones or diphones provides 

the largest output range, but may lack clarity. For specific usage 

domains, the storage of entire words or sentences allows for 

high-quality output. Alternatively, a synthesizer can incorporate a 

model of the vocal tract and other human voice characteristics to create 
• • [21 

a completely "synthetic" voice output. 



The quality of a speech synthesizer is judged by its similarity to the 
human voice and by its ability to be understood. An intelligible Stephen Hawking is one of the most famous 
text-to-speech program allows people with visual impairments or people using speech synthesis to communicate 
reading disabilities to listen to written works on a home computer. 

Many computer operating systems have included speech synthesizers since the early 1990s. 


Overview of text processing 

A text-to-speech system (or "engine") 
is composed of two parts: a 

front-end and a back-end. The 
front-end has two major tasks. First, it 
converts raw text containing symbols 
like numbers and abbreviations into the 
equivalent of written-out words. This 
process is often called text 
normalization, pre-processing, or 
tokenization. The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into 
prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is 
called text-to-phoneme or grapheme-to-phoneme conversion. Phonetic transcriptions and prosody information 
together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred 
to as the synthesizer —then converts the symbolic linguistic representation into sound. In certain systems, this part 
includes the computation of the target prosody (pitch contour, phoneme durations), which is then imposed on the 
output speech. 
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History 

Long before electronic signal processing was invented, there were those who tried to build machines to create human 
speech. Some early legends of the existence of "speaking heads" involved Gerbert of Aurillac (d. 1003 AD), 
Albertus Magnus (1198-1280), and Roger Bacon (1214-1294). 

In 1779, the Danish scientist Christian Kratzenstein, working at the Russian Academy of Sciences, built models of 

the human vocal tract that could produce the five long vowel sounds (in International Phonetic Alphabet notation, 

they are [a:], [e:], [i:], [o:] and [u:]).^ This was followed by the bellows-operated "acoustic-mechanical speech 

machine" by Wolfgang von Kempelen of Pressburg, Hungary, described in a 1791 paper.' 6 This machine added 

models of the tongue and lips, enabling it to produce consonants as well as vowels. In 1837, Charles Wheatstone 

produced a "speaking machine" based on von Kempelen's design, and in 1857, M. Faber built the "Euphonia". 

[71 

Wheatstone's design was resurrected in 1923 by Paget. 

In the 1930s, Bell Labs developed the vocoder, which automatically analyzed speech into its fundamental tone and 
resonances. From his work on the vocoder, Homer Dudley developed a manually keyboard-operated voice 
synthesizer called The Voder (Voice Demonstrator), which he exhibited at the 1939 New York World's Fair. 

The Pattern playback was built by Dr. Franklin S. Cooper and his colleagues at Haskins Laboratories in the late 
1940s and completed in 1950. There were several different versions of this hardware device but only one currently 
survives. The machine converts pictures of the acoustic patterns of speech in the form of a spectrogram back into 
sound. Using this device, Alvin Liberman and colleagues were able to discover acoustic cues for the perception of 
phonetic segments (consonants and vowels). 

Dominant systems in the 1980s and 1990s were the MITalk system, based largely on the work of Dennis Klatt at 

T81 

MIT, and the Bell Labs system, J the latter was one of the first multilingual language-independent systems, making 
extensive use of natural language processing methods. 

Early electronic speech synthesizers sounded robotic and were often barely intelligible. The quality of synthesized 
speech has steadily improved, but output from contemporary speech synthesis systems is still clearly distinguishable 
from actual human speech. 

As the cost-performance ratio causes speech synthesizers to become cheaper and more accessible to the people, more 
people will benefit from the use of text-to-speech programs.^ 

Electronic devices 

The first computer-based speech synthesis systems were created in the late 1950s. The first general English 
text-to-speech system was developed by Noriko Umeda et al. in 1968 at the Electrotehnical Laboratory, Japan.' 10 In 
1961, physicist lohn Larry Kelly, Jr and colleague Louis Gerstman' 11 ' used an IBM 704 computer to synthesize 
speech, an event among the most prominent in the history of Bell Labs. Kelly's voice recorder synthesizer (vocoder) 
recreated the song "Daisy Bell", with musical accompaniment from Max Mathews. Coincidentally, Arthur C. Clarke 
was visiting his friend and colleague John Pierce at the Bell Labs Murray Hill facility. Clarke was so impressed by 
the demonstration that he used it in the climactic scene of his screenplay for his novel 2001: A Space CWyssey, Arthur 
C. Clarke Biography at the Wayback Machine (archived December 11, 1997) where the HAL 9000 computer 
sings the same song as it is being put to sleep by astronaut Dave Bowman."Where "HAL" First Spoke (Bell Labs 
Speech Synthesis website)" . Bell Labs. Retrieved 2010-02-17. Despite the success of purely electronic speech 
synthesis, research is still being conducted into mechanical speech synthesizers.Anthropomorphic Talking Robot 
Waseda-Talker Series ' ' ^Handheld electronics featuring speech synthesis began emerging in the 1970s. One of the 
first was the Telesensory Systems Inc. (TSI) Speech- 1 - portable calculator for the blind in 1976.TSI Speech+ & other 
speaking calculators ' 1 Gevaryahu, Jonathan, "TSI S14001A Speech Synthesizer LSI Integrated Circuit Guide" ' l6 ' 
Other devices were produced primarily for educational purposes, such as Speak & Spell, produced by Texas 
InstrumentsBreslow, et al. United States Patent 4326710: "Talking electronic game" ' 17 April 27, 1982 in 1978. 
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Fidelity released a speaking version of its electronic chess computer in 1979.Voice Chess Challenger L J The first 
video game to feature speech synthesis was the 1980 shoot 'em up arcade game, Stratovox , from Sun 
Electronics.Gaming's Most Important Evolutions GamesRadar Another early example was the arcade version of 

Bezerk, released that same year. The first multi-player electronic game using voice synthesis was Milton from Milton 
Bradley Company, which produced the device in 1980. 

Synthesizer technologies 

The most important qualities of a speech synthesis system are naturalness and intelligibility. Naturalness describes 
how closely the output sounds like human speech, while intelligibility is the ease with which the output is 
understood. The ideal speech synthesizer is both natural and intelligible. Speech synthesis systems usually try to 
maximize both characteristics. 

The two primary technologies for generating synthetic speech waveforms are concatenative synthesis and formant 
synthesis. Each technology has strengths and weaknesses, and the intended uses of a synthesis system will typically 
determine which approach is used. 

Concatenative synthesis 

Concatenative synthesis is based on the concatenation (or stringing together) of segments of recorded speech. 
Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences 
between natural variations in speech and the nature of the automated techniques for segmenting the waveforms 
sometimes result in audible glitches in the output. There are three main sub-types of concatenative synthesis. 

Unit selection synthesis 

Unit selection synthesis uses large databases of recorded speech. During database creation, each recorded utterance 
is segmented into some or all of the following: individual phones, diphones, half-phones, syllables, morphemes, 
words, phrases, and sentences. Typically, the division into segments is done using a specially modified speech 
recognizer set to a "forced alignment" mode with some manual correction afterward, using visual representations 
such as the waveform and spectrogram/ 20 ^ An index of the units in the speech database is then created based on the 
segmentation and acoustic parameters like the fundamental frequency (pitch), duration, position in the syllable, and 
neighboring phones. At run time, the desired target utterance is created by determining the best chain of candidate 
units from the database (unit selection). This process is typically achieved using a specially weighted decision tree. 

Unit selection provides the greatest naturalness, because it applies only a small amount of digital signal processing 
(DSP) to the recorded speech. DSP often makes recorded speech sound less natural, although some systems use a 
small amount of signal processing at the point of concatenation to smooth the waveform. The output from the best 
unit-selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS 
system has been tuned. However, maximum naturalness typically require unit-selection speech databases to be very 
large, in some systems ranging into the gigabytes of recorded data, representing dozens of hours of speech. Also, 
unit selection algorithms have been known to select segments from a place that results in less than ideal synthesis 
(e.g. minor words become unclear) even when a better choice exists in the database. Recently, researchers have 

[231 

proposed various automated methods to detect unnatural segments in unit-selection speech synthesis systems. 

Diphone synthesis 

Diphone synthesis uses a minimal speech database containing all the diphones (sound-to-sound transitions) occurring 
in a language. The number of diphones depends on the phonotactics of the language: for example, Spanish has about 
800 diphones, and German about 2500. In diphone synthesis, only one example of each diphone is contained in the 
speech database. At runtime, the target prosody of a sentence is superimposed on these minimal units by means of 
digital signal processing techniques such as linear predictive coding, PSOLA or MBROLA. Diphone synthesis 
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suffers from the sonic glitches of concatenative synthesis and the robotic-sounding nature of formant synthesis, and 
has few of the advantages of either approach other than small size. As such, its use in commercial applications is 
declining, although it continues to be used in research because there are a number of freely available software 
implementations. 

Domain-specific synthesis 

Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances. It is used in 
applications where the variety of texts the system will output is limited to a particular domain, like transit schedule 
announcements or weather reports.' -6 The technology is very simple to implement, and has been in commercial use 
for a long time, in devices like talking clocks and calculators. The level of naturalness of these systems can be very 
high because the variety of sentence types is limited, and they closely match the prosody and intonation of the 
original recordings. 

Because these systems are limited by the words and phrases in their databases, they are not general-purpose and can 
only synthesize the combinations of words and phrases with which they have been preprogrammed. The blending of 
words within naturally spoken language however can still cause problems unless the many variations are taken into 
account. For example, in non-rhotic dialects of English the "r" in words like "clear" /' Idle/ is usually only 
pronounced when the following word has a vowel as its first letter (e.g. "clear out" is realized as /klier'AUt/)- 
Likewise in French, many final consonants become no longer silent if followed by a word that begins with a vowel, 
an effect called liaison. This alternation cannot be reproduced by a simple word-concatenation system, which would 
require additional complexity to be context-sensitive. 

Formant synthesis 

Formant synthesis does not use human speech samples at runtime. Instead, the synthesized speech output is created 
using additive synthesis and an acoustic model (physical modelling synthesis). Parameters such as fundamental 
frequency, voicing, and noise levels are varied over time to create a waveform of artificial speech. This method is 
sometimes called rules-based synthesis ; however, many concatenative systems also have rules-based components. 
Many systems based on formant synthesis technology generate artificial, robotic-sounding speech that would never 
be mistaken for human speech. However, maximum naturalness is not always the goal of a speech synthesis system, 
and formant synthesis systems have advantages over concatenative systems. Formant-synthesized speech can be 
reliably intelligible, even at very high speeds, avoiding the acoustic glitches that commonly plague concatenative 
systems. High-speed synthesized speech is used by the visually impaired to quickly navigate computers using a 
screen reader. Formant synthesizers are usually smaller programs than concatenative systems because they do not 
have a database of speech samples. They can therefore be used in embedded systems, where memory and 
microprocessor power are especially limited. Because formant-based systems have complete control of all aspects of 
the output speech, a wide variety of prosodies and intonations can be output, conveying not just questions and 
statements, but a variety of emotions and tones of voice. 

Examples of non-real-time but highly accurate intonation control in formant synthesis include the work done in the 

T281 

late 1970s for the Texas Instruments toy Speak & Spell, and in the early 1980s Sega arcade machines 1 " J and in 
many Atari, Inc. arcade games' - '' 1 ' using the TMS5220 LPC Chips. Creating proper intonation for these projects was 
painstaking, and the results have yet to be matched by real-time text-to-speech interfaces' 30 ^ 
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Articulatory synthesis 

Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human 
vocal tract and the articulation processes occurring there. The first articulatory synthesizer regularly used for 
laboratory experiments was developed at Haskins Laboratories in the mid-1970s by Philip Rubin, Tom Baer, and 
Paul Mermelstein. This synthesizer, known as ASY, was based on vocal tract models developed at Bell Laboratories 
in the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and colleagues. 

Until recently, articulatory synthesis models have not been incorporated into commercial speech synthesis systems. 
A notable exception is the NeXT-based system originally developed and marketed by Trillium Sound Research, a 
spin-off company of the University of Calgary, where much of the original research was conducted. Following the 
demise of the various incarnations of NeXT (started by Steve Jobs in the late 1980s and merged with Apple 
Computer in 1997), the Trillium software was published under the GNU General Public License, with work 
continuing as gnuspeech. The system, first marketed in 1994, provides full articulatory-based text-to-speech 
conversion using a waveguide or transmission-line analog of the human oral and nasal tracts controlled by Carre's 
"distinctive region model". 

HMM-based synthesis 

HMM-based synthesis is a synthesis method based on hidden Markov models, also called Statistical Parametric 
Synthesis. In this system, the frequency spectrum (vocal tract), fundamental frequency (vocal source), and duration 
(prosody) of speech are modeled simultaneously by HMMs. Speech waveforms are generated from HMMs 

• . i . . [ 31 ] 

themselves based on the maximum likelihood criterion. 


Sinewave synthesis 

Sinewave synthesis is a technique for synthesizing speech by replacing the formants (main bands of energy) with 
pme tone whistles. 

Challenges 

Text normalization challenges 

The process of normalizing text is rarely straightforward. Texts are full of heteronyms, numbers, and abbreviations 
that all require expansion into a phonetic representation. There are many spellings in English which are pronounced 
differently based on context. For example, "My latest project is to learn how to better project my voice" contains two 
pronunciations of "project". 

Most text-to-speech (TTS) systems do not generate semantic representations of their input texts, as processes for 
doing so are not reliable, well understood, or computationally effective. As a result, various heuristic techniques are 
used to guess the proper way to disambiguate homographs, like examining neighboring words and using statistics 
about frequency of occurrence. 

Recently TTS systems have begun to use HMMs (discussed above) to generate "parts of speech" to aid in 
disambiguating homographs. This technique is quite successful for many cases such as whether "read" should be 
pronounced as "red" implying past tense, or as "reed" implying present tense. Typical error rates when using HMMs 
in this fashion are usually below five percent. These techniques also work well for most European languages, 
although access to required training corpora is frequently difficult in these languages. 

Deciding how to convert numbers is another problem that TTS systems have to address. It is a simple programming 
challenge to convert a number into words (at least in English), like "1325" becoming "one thousand three hundred 
twenty-five." However, numbers occur in many different contexts; "1325" may also be read as "one three two five", 
"thirteen twenty-five" or "thirteen hundred and twenty five". A TTS system can often infer how to expand a number 
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based on surrounding words, numbers, and punctuation, and sometimes the system provides a way to specify the 

F33i 

context if it is ambiguous. Roman numerals can also be read differently depending on context. For example 
"Henry VIII" reads as "Henry the Eighth", while "Chapter VIII" reads as "Chapter Eight". 

Similarly, abbreviations can be ambiguous. For example, the abbreviation "in" for "inches" must be differentiated 
from the word "in", and the address "12 St John St." uses the same abbreviation for both "Saint" and "Street". TTS 
systems with intelligent front ends can make educated guesses about ambiguous abbreviations, while others provide 
the same result in all cases, resulting in nonsensical (and sometimes comical) outputs, such as "co-operation" being 
rendered as "company operation". 

Text-to-phoneme challenges 

Speech synthesis systems use two basic approaches to determine the pronunciation of a word based on its spelling, a 
process which is often called text-to-phoneme or grapheme-to-phoneme conversion (phoneme is the term used by 
linguists to describe distinctive sounds in a language). The simplest approach to text-to-phoneme conversion is the 
dictionary-based approach, where a large dictionary containing all the words of a language and their correct 
pronunciations is stored by the program. Determining the correct pronunciation of each word is a matter of looking 
up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary. The 
other approach is rule-based, in which pronunciation rules are applied to words to determine their pronunciations 
based on their spellings. This is similar to the "sounding out", or synthetic phonics, approach to learning reading. 

Each approach has advantages and drawbacks. The dictionary-based approach is quick and accurate, but completely 
fails if it is given a word which is not in its dictionary. As dictionary size grows, so too does the memory space 
requirements of the synthesis system. On the other hand, the rule-based approach works on any input, but the 
complexity of the rules grows substantially as the system takes into account irregular spellings or pronunciations. 
(Consider that the word "of" is very common in English, yet is the only word in which the letter "f" is pronounced 
[v].) As a result, nearly all speech synthesis systems use a combination of these approaches. 

Languages with a phonemic orthography have a very regular writing system, and the prediction of the pronunciation 
of words based on their spellings is quite successful. Speech synthesis systems for such languages often use the 
rule-based method extensively, resorting to dictionaries only for those few words, like foreign names and 
borrowings, whose pronunciations are not obvious from their spellings. On the other hand, speech synthesis systems 
for languages like English, which have extremely irregular spelling systems, are more likely to rely on dictionaries, 
and to use rule-based methods only for unusual words, or words that aren't in their dictionaries. 

Evaluation challenges 

The consistent evaluation of speech synthesis systems may be difficult because of a lack of universally agreed 
objective evaluation criteria. Different organizations often use different speech data. The quality of speech synthesis 
systems also depends to a large degree on the quality of the production technique (which may involve analogue or 
digital recording) and on the facilities used to replay the speech. Evaluating speech synthesis systems has therefore 
often been compromised by differences between production techniques and replay facilities. 

Recently, however, some researchers have started to evaluate speech synthesis systems using a common speech 
dataset.'- 34 -' 
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Prosodies and emotional content 

A study in the journal "Speech Communication" by Amy Drahota and colleagues at the University of Portsmouth, 
UK, reported that listeners to voice recordings could determine, at better than chance levels, whether or not the 
speaker was smiling. It was suggested that identification of the vocal features that signal emotional content 

may be used to help make synthesized speech sound more natural. 

Dedicated hardware 

• Votrax 

• SC-01A (analog formant) 

• SC-02 / SSI-263 / "Artie 263" 

• General Instrument SP0256-AL2 (CTS256A-AL2) 

• Magnevation SpeakJet (www.speechchips.com TTS256) 

• National Semiconductor DT1050 Digitalker (Mozer - Forrest Mozer) 

• Silicon Systems SSI 263 (analog formant) 

• Texas Instruments LPC Speech Chips ** TMS5110A ** TMS5200 

• MSP50C6XX - Sold to Sensory, Inc. in 2001 [38] 

Released in 2011 24 language core module system 

• TextSpeak Embedded TTS-EMHD2 Modules - Standalone - World Languages 

Computer operating systems or outlets with speech synthesis 

Atari 

Arguably, the first speech system integrated into an operating system was the 1400XL/1450XL personal computers 
designed by Atari, Inc. using the Votrax SC01 chip in 1983. The 1400XL/1450XL computers used a Finite State 
Machine to enable World English Spelling text-to-speech synthesis.^ Unfortunately, the 1400XL/1450XL personal 
computers never shipped in quantity. 

The Atari ST computers were sold with "stspeech.tos" on floppy disk. 

Apple 

The first speech system integrated into an operating system that shipped in quantity was Apple Computer's 
MacInTalk in 1984. The software was licensed from 3rd party developers Joseph Katz and Mark Barton (later, 
SoftVoice, Inc.) and was featured during the 1984 introduction of the Macintosh computer. Since the 1980s 
Macintosh Computers offered text to speech capabilities through The MacinTalk software. In the early 1990s Apple 
expanded its capabilities offering system wide text-to-speech support. With the introduction of faster 
PowerPC-based computers they included higher quality voice sampling. Apple also introduced speech recognition 
into its systems which provided a fluid command set. More recently, Apple has added sample-based voices. Starting 
as a curiosity, the speech system of Apple Macintosh has evolved into a fully supported program, PlainTalk, for 
people with vision problems. VoiceOver was for the first time featured in Mac OS X Tiger (10.4). During 10.4 
(Tiger) & first releases of 10.5 (Leopard) there was only one standard voice shipping with Mac OS X. Starting with 
10.6 (Snow Leopard), the user can choose out of a wide range list of multiple voices. VoiceOver voices feature the 
taking of realistic-sounding breaths between sentences, as well as improved clarity at high read rates over PlainTalk. 
Mac OS X also includes say, a command-line based application that converts text to audible speech. The AppleScript 
Standard Additions includes a say verb that allows a script to use any of the installed voices and to control the pitch, 
speaking rate and modulation of the spoken text. 
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The Apple iOS operating system used on the iPhone, iPad and iPod Touch uses VoiceOver speech synthesis for 
accessibility. Some third party applications also provide speech synthesis to facilitate navigating, reading web 
pages or translating text. 

AmigaOS 

The second operating system with advanced speech synthesis capabilities was AmigaOS, introduced in 1985. The 
voice synthesis was licensed by Commodore International from SoftVoice, Inc., who also developed the original 
MacinTalk text-to-speech system. It featured a complete system of voice emulation, with both male and female 
voices and "stress" indicator markers, made possible by advanced features of the Amiga hardware audio chipset. 

It was divided into a narrator device and a translator library. Amiga Speak Handler featured a text-to-speech 
translator. AmigaOS considered speech synthesis a virtual hardware device, so the user could even redirect console 
output to it. Some Amiga programs, such as word processors, made extensive use of the speech system. 

Microsoft Windows 

Modern Windows desktop systems can use SAPI 4 and SAPI 5 components to support speech synthesis and speech 
recognition. SAPI 4.0 was available as an optional add-on for Windows 95 and Windows 98. Windows 2000 added 
Narrator, a text—to—speech utility for people who have visual handicaps. Third-party programs such as CoolSpeech, 
Textaloud and Ultra Hal can perform various text-to-speech tasks such as reading text aloud from a specified 
website, email account, text document, the Windows clipboard, the user's keyboard typing, etc. Not all programs can 
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use speech synthesis directly. Some programs can use plug-ins, extensions or add-ons to read text aloud. 
Third-party programs are available that can read text from the system clipboard. 

Microsoft Speech Server is a server-based package for voice synthesis and recognition. It is designed for network 
use with web applications and call centers. 

Text-to-Speech (TTS) refers to the ability of computers to read text aloud. A TTS Engine converts written text to a 
phonemic representation, then converts the phonemic representation to waveforms that can be output as sound. TTS 
engines with different languages, dialects and specialized vocabularies are available through third-party 
publishers.^ 

Android 

Version 1.6 of Android added support for speech synthesis (TTS).' 44 ' 

Internet 

Currently, there are a number of applications, plugins and gadgets that can read messages directly from an e-mail 
client and web pages from a web browser or Google Toolbar such as Text-to-voice which is an add-on to Firefox. 
Some specialized software can narrate RSS-feeds. On one hand, online RSS-narrators simplify information delivery 
by allowing users to listen to their favourite news sources and to convert them to podcasts. On the other hand, 
on-line RSS-readers are available on almost any PC connected to the Internet. Users can download generated audio 
files to portable devices, e.g. with a help of podcast receiver, and listen to them while walking, jogging or 
commuting to work. 

A growing field in Internet based TTS is web-based assistive technology, e.g. ’Browsealoud’ from a UK company 
and Readspeaker. It can deliver TTS functionality to anyone (for reasons of accessibility, convenience, entertainment 
or information) with access to a web browser. The non-profit project Pediaphon was created in 2006 to provide a 
similar web-based TTS interface to the Wikipedia.' 46 ' 

Other work is being done in the context of the W3C through the W3C Audio Incubator Group ' 47 ' with the 
involvement of The BBC and Google Inc. 
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Others 

• Some e-book readers, such as the Amazon Kindle, Samsung E6, PocketBook eReader Pro, enTourage eDGe, and 
the Bebook Neo. 

• Some models of Texas Instruments home computers produced in 1979 and 1981 (Texas Instruments TI-99/4 and 
TI-99/4A) were capable of text-to-phoneme synthesis or reciting complete words and phrases (text-to-dictionary), 
using a very popular Speech Synthesizer peripheral. TI used a proprietary codec to embed complete spoken 
phrases into applications, primarily video games." 48 ' 

• IBM's OS/2 Warp 4 included VoiceType, a precursor to IBM ViaVoice. 

• Systems that operate on free and open source software systems including Linux are various, and include 
open-source programs such as the Festival Speech Synthesis System which uses diphone-based synthesis (and can 
use a limited number of MBROLA voices), and gnuspeech which uses articulatory synthesis' 4 " 1 from the Free 
Software Foundation. 

• Companies which developed speech synthesis systems but which are no longer in this business include BeST 
Speech (bought by L&H), Eloquent Technology (bought by SpeechWorks), Lernout & Hauspie (bought by 
Nuance), SpeechWorks (bought by Nuance), Rhetorical Systems (bought by Nuance). 

• GPS Navigation units produced by Garmin, Magellan, TomTom and others use speech synthesis for automobile 
navigation. 

Speech synthesis markup languages 

A number of markup languages have been established for the rendition of text as speech in an XML-compliant 
format. The most recent is Speech Synthesis Markup Language (SSML), which became a W3C recommendation in 
2004. Older speech synthesis markup languages include Java Speech Markup Language (JSML) and SABLE. 
Although each of these was proposed as a standard, none of them has been widely adopted. 

Speech synthesis markup languages are distinguished from dialogue markup languages. VoiceXML, for example, 
includes tags related to speech recognition, dialogue management and touchtone dialing, in addition to text-to-speech 
markup. 

Applications 

Speech synthesis has long been a vital assistive technology tool and its application in this area is significant and 
widespread. It allows environmental barriers to be removed for people with a wide range of disabilities. The longest 
application has been in the use of screen readers for people with visual impairment, but text-to-speech systems are 
now commonly used by people with dyslexia and other reading difficulties as well as by pre-literate children. They 
are also frequently employed to aid those with severe speech impairment usually through a dedicated voice output 
communication aid. 

Speech synthesis techniques are also used in entertainment productions such as games and animations. In 2007, 
Animo Limited announced the development of a software application package based on its speech synthesis software 
FineSpeech, explicitly geared towards customers in the entertainment industries, able to generate narration and lines 
of dialogue according to user specifications." 50 ' The application reached maturity in 2008, when NEC Biglobe 
announced a web service that allows users to create phrases from the voices of Code Geass: Lelouch of the Rebellion 
R2 characters." 51 ' 

In recent years. Text to Speech for disability and handicapped communication aids have become widely deployed in 
Mass Transit. Companies like TalkingSigns and TextSpeak Systems have pioneered solutions such as TTS for 
Digital Signage for the Blind ' that work via standard speakers and also radio receivers (ex: BART in the SF Bay 
area). Text to Speech is also finding new applications outside the disability market. For example, speech synthesis, 
combined with speech recognition, allows for interaction with mobile devices via natural language processing 
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interfaces. Speech synthesis is also used for facilitating the creation of online presentations. 
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Directory Project 

• Text to Voice or Text to Speech Firefox Addon (https://addons.mozilla.org/en-US/firefox/addon/ 
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• IVONA Text-To-Speech (http://www.ivona.com) 
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• Reviews of Popular Speech Synthesizers (http://www.webbie.org.uk/Veli-Pekka/reviews_of_speech_synths. 
html) by Veli-Pekka Tatila 

• Free online human voice, speaking an entered text. Round trip translation and speech of text, available in many 
languages. (http://trans 121. com) 


Human voice 


The human voice consists of sound made by 
a human being using the vocal folds for 
talking, singing, laughing, crying, 
screaming, etc. Its frequency ranges from 
about 60 to 7000 Hz. The human voice is 
specifically that part of human sound 
production in which the vocal folds (vocal 
cords) are the primary sound source. 

Generally speaking, the mechanism for 
generating the human voice can be 
subdivided into three parts; the lungs, the 
vocal folds within the larynx, and the 
articulators. The lung (the pump) must 
produce adequate airflow and air pressure to 
vibrate vocal folds (this ah' pressure is the 
fuel of the voice). The vocal folds (vocal cords) are a vibrating valve that chops up the airflow from the lungs into 
audible pulses that form the laryngeal sound source. The muscles of the larynx adjust the length and tension of the 
vocal folds to ‘fine tune’ pitch and tone. The articulators (the parts of the vocal tract above the larynx consisting of 
tongue, palate, cheek, lips, etc.) articulate and filter the sound emanating from the larynx and to some degree can 
interact with the laryngeal airflow to strengthen it or weaken it as a sound source. 

The vocal folds, in combination with the articulators, are capable of producing highly intricate arrays of 
sound.The tone of voice may be modulated to suggest emotions such as anger, surprise, or happinessJ 4 ^ 5 ' 
Singers use the human voice as an instrument for creating music.^ 
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Voice types and the folds (cords) themselves 

Adult men and women have different 
sizes of vocal fold; reflecting the 
male-female differences in larynx size. 

Adult male voices are usually 
lower-pitched and have larger folds. 

The male vocal folds (which would be 
measured vertically in the opposite 
diagram), are between 17 mm and 
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25 mm in length. The female vocal 
folds are between 12.5 mm and 
17.5 mm in length. 

As seen in the illustration, the folds are 
located just above the vertebrate 
trachea (the windpipe, which travels 
from the lungs). Food and drink do not 
pass through the cords but instead pass 
through the esophagus, an unlinked 
tube. Both tubes are separated by the epiglottis, a "flap" that covers the opening of the trachea while swallowing. 

The folds in both sexes are within the larynx. They are attached at the back (side nearest the spinal cord) to the 
arytenoids cartilages, and at the front (side under the chin) to the thyroid cartilage. They have no outer edge as they 
blend into the side of the breathing tube (the illustration is out of date and does not show this well) while their inner 
edges or "margins" are free to vibrate (the hole). They have a three layer construction of an epithelium, vocal 
ligament, then muscle (vocalis muscle), which can shorten and bulge the folds. They are flat triangular bands and are 
pearly white in color. Above both sides of the vocal cord is the vestibular fold or false vocal cord, which has a small 
sac between its two folds (not illustrated). 

The difference in vocal folds size between men and women means that they have differently pitched voices. 
Additionally, genetics also causes variances amongst the same sex, with men and women's singing voices being 
categorized into types. For example, among men, there are bass, baritone, tenor and countertenor (ranging from E2 
to even F6), and among women, contralto, mezzo-soprano and soprano (ranging from F3 to C6). There are additional 
categories for operatic voices, see voice type. This is not the only source of difference between male and female 
voice. Men, generally speaking, have a larger vocal tract, which essentially gives the resultant voice a 
lower-sounding timbre. This is mostly independent of the vocal folds themselves. 

Voice modulation in spoken language 

Human spoken language makes use of the ability of almost all persons in a given society to dynamically modulate 
certain parameters of the laryngeal voice source in a consistent manner. The most important communicative, or 
phonetic, parameters are the voice pitch (determined by the vibratory frequency of the vocal folds) and the degree of 
separation of the vocal folds, referred to as vocal fold adduction (coming together) or abduction (separating). 1 1 

The ability to vary the ab/adduction of the vocal folds quickly has a strong genetic component, since vocal fold 
adduction has a life-preserving function in keeping food from passing into the lungs, in addition to the covering 
action of the epiglottis. Consequently, the muscles that control this action are among the fastest in the body. L 1 
Children can learn to use this action consistently during speech at an early age, as they learn to speak the difference 
between utterances such as "apa" (having an abductory-adductory gesture for the p) as "aba" (having no 
abductory-adductory gesture). 1 J Surprisingly enough, they can learn to do this well before the age of two by 
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listening only to the voices of adults around them who have voices much different from their own, and even though 
the laryngeal movements causing these phonetic differentiations are deep in the throat and not visible to them. 

If an abductory movement or adductory movement is strong enough, the vibrations of the vocal folds will stop (or 
not start). If the gesture is abductory and is part of a speech sound, the sound will be called Voiceless. However, 
voiceless speech sounds are sometimes better identified as containing an abductory gesture, even if the gesture was 
not strong enough to stop the vocal folds from vibrating. This anomalous feature of voiceless speech sounds is better 
understood if it is realized that it is the change in the spectral qualities of the voice as abduction proceeds that is the 
primary acoustic attribute that the listener attends to when identifying a voiceless speech sound, and not simply the 
presence or absence of voice (periodic energy) 

An adductory gesture is also identified by the change in voice spectral energy it produces. Thus, a speech sound 
having an adductory gesture may be referred to as a "glottal stop" even if the vocal fold vibrations do not entirely 
stopJ 9 ' for an example illustrating this, obtained by using the inverse filtering ' l0 ' of oral airflow.] 

Other aspects of the voice, such as variations in the regularity of vibration, are also used for communication, and are 
important for the trained voice user to master, but are more rarely used in the formal phonetic code of a spoken 
language. 

Physiology and vocal timbre 

The sound of each individual's voice is entirely unique not only because of the actual shape and size of an 
individual's vocal cords but also due to the size and shape of the rest of that person's body, especially the vocal tract, 
and the manner in which the speech sounds are habitually formed and articulated. (It is this latter aspect of the sound 
of the voice that can be mimicked by skilled performers.) Humans have vocal folds that can loosen, tighten, or 
change their thickness, and over which breath can be transferred at varying pressures. The shape of chest and neck, 
the position of the tongue, and the tightness of otherwise unrelated muscles can be altered. Any one of these actions 
results in a change in pitch, volume, timbre, or tone of the sound produced. Sound also resonates within different 
parts of the body, and an individual's size and bone structure can affect somewhat the sound produced by an 
individual. 

Singers can also learn to project sound in certain ways so that it resonates better within their vocal tract. This is 
known as vocal resonation. Another major influence on vocal sound and production is the function of the larynx, 
which people can manipulate in different ways to produce different sounds. These different kinds of laryngeal 
function are described as different kinds of vocal registers.^ 1 ^ The primary method for singers to accomplish this is 
through the use of the Singer's Formant , which has been shown to be a resonance added to the normal resonances 
of the vocal tract above the frequency range of most instruments and so enables the singer's voice to carry better over 
musical accompaniment 

Vocal registration 

Vocal registration refers to the system of vocal registers within the human voice. A register in the human voice is a 
particular series of tones, produced in the same vibratory pattern of the vocal folds, and possessing the same quality. 
Registers originate in laryngeal functioning. They occur because the vocal folds are capable of producing several 
different vibratory patterns. Each of these vibratory patterns appears within a particular Vocal range range of pitches 
and produces certain characteristic sounds.^ the term register can be somewhat confusing as it encompasses 
several aspects of the human voice. The term register can be used to refer to any of the following:^ 1(4 

• A particular part of the vocal range such as the upper, middle, or lower registers. 

• A resonance area such as chest voice or head voice. 

• A phonatory process. 

• A certain vocal timbre. 
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• A region of the voice that is defined or delimited by vocal breaks. 

• A subset of a language used for a particular purpose or in a particular social setting. 

In linguistics, a register language is a language that combines tone and vowel phonation into a single phonological 
system. 

Within speech pathology the term vocal register has three constituent elements: a certain vibratory pattern of the 
vocal folds, a certain series of pitches, and a certain type of sound. Speech pathologists identify lour vocal registers 
based on the physiology of laryngeal function: the vocal fry register, the modal register, and the falsetto register, and 
the whistle register. This view is also adopted by many vocal pedagogists.' 1 ^ 

Vocal resonation 

Vocal resonation is the process by which the basic product of phonation is enhanced in timbre and/or intensity by 
the air-filled cavities through which it passes on its way to the outside air. Various terms related to the resonation 
process include amplification, enrichment, enlargement, improvement, intensification, and prolongation; although in 
strictly scientific usage acoustic authorities would question most of them. The main point to be drawn from these 
terms by a singer or speaker is that the end result of resonation is, or should be, to make a better sound.' l6 ' There are 
seven areas that may be listed as possible vocal resonators. In sequence from the lowest within the body to the 

highest, these areas are the chest, the tracheal tree, the larynx itself, the pharynx, the oral cavity, the nasal cavity, and 

• [17] 

the sinuses. 

Influences of the human voice 

The twelve-tone musical scale, upon which some of the music in the world is based, may have its roots in the sound 
of the human voice during the course of evolution, according to a study published by the New Scientist. Analysis of 
recorded speech samples found peaks in acoustic energy that mirrored the distances between notes in the twelve-tone 
scale.'- 18 -' 

Voice disorders 

There are many disorders that affect the human voice; these include speech impediments, and growths and lesions on 
the vocal folds. Talking improperly for long periods of time causes vocal loading, which is stress inflicted on the 
speech organs. When vocal injury is done, often an ENT specialist may be able to help, but the best treatment is the 

ri9i 

prevention of injuries through good vocal production. Voice therapy is generally delivered by a speech-language 
pathologist. 

Vocal Cord Nodules and Polyps 

Vocal nodules are caused over time by repeated abuse of the vocal cords which results in soft, swollen spots on each 
vocal cord. These spots develop into harder, callous-like growths called nodules. The longer the abuse occurs the 
larger and stiffer the nodules will become. Most polyps are larger than nodules and may be called by other names, 
such as polypoid degeneration or Reinke's edema. Polyps are caused by a single occurrence and may require surgical 
removal. Irritation after the removal may then lead to nodules if additional irritation persists. Speech-language 
therapy teaches the patient how to eliminate the irritations permanently through habit changes and vocal hygiene. 
Hoarseness or breathiness that lasts for more than two weeks is a common symptom of an underlying voice disorder 
such as nodes or polyps and should be investigated medically. 
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• Voice acoustics: an introduction (http://www.phys.unsw.edu.au/jw/voice.html) from the University of New 
South Wales 


Linear predictive coding 


Linear predictive coding (LPC) is a tool used mostly in audio signal processing and speech processing for 
representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear 
predictive model J 1 ' It is one of the most powerful speech analysis techniques, and one of the most useful methods 
for encoding good quality speech at a low bit rate and provides extremely accurate estimates of speech parameters. 

Overview 

LPC starts with the assumption that a speech signal is produced by a buzzer at the end of a tube (voiced sounds), 
with occasional added hissing and popping sounds (sibilants and plosive sounds). Although apparently crude, this 
model is actually a close approximation of the reality of speech production. The glottis (the space between the vocal 
folds) produces the buzz, which is characterized by its intensity (loudness) and frequency (pitch). The vocal tract (the 
throat and mouth) forms the tube, which is characterized by its resonances, which give rise to formants, or enhanced 
frequency bands in the sound produced. Hisses and pops are generated by the action of the tongue, lips and throat 
during sibilants and plosives. 

LPC analyzes the speech signal by estimating the formants, removing their effects from the speech signal, and 
estimating the intensity and frequency of the remaining buzz. The process of removing the formants is called inverse 
filtering, and the remaining signal after the subtraction of the filtered modeled signal is called the residue. 

The numbers which describe the intensity and frequency of the buzz, the formants, and the residue signal, can be 
stored or transmitted somewhere else. LPC synthesizes the speech signal by reversing the process: use the buzz 
parameters and the residue to create a source signal, use the formants to create a filter (which represents the tube), 
and run the source through the filter, resulting in speech. 

Because speech signals vary with time, this process is done on short chunks of the speech signal, which are called 
frames; generally 30 to 50 frames per second give intelligible speech with good compression. 

Early history of LPC 

According to Robert M. Gray of Stanford University, the first ideas leading to LPC started in 1966 when S. Saito and 
F. Itakura of NTT described an approach to automatic phoneme discrimination that involved the first maximum 
likelihood approach to speech coding. In 1967, John Burg outlined the maximum entropy approach. In 1969 Itakura 
and Saito introduced partial correlation. May Glen Culler proposed realtime speech encoding, and B. S. Atal 
presented an LPC speech coder at the Annual Meeting of the Acoustical Society of America. In 1971 realtime LPC 
using 16-bit LPC hardware was demonstrated by Philco-Ford; lour units were sold. 

In 1972 Bob Kahn of ARP A, with Jim Forgie (Lincoln Laboratory, LL) and Dave Walden (BBN Technologies), 
started the first developments in packetized speech, which would eventually lead to Voice over IP technology. In 
1973, according to Lincoln Laboratory informal history, the first realtime 2400 bit/s LPC was implemented by Ed 
Hofstetter. In 1974 the first realtime two-way LPC packet speech communication was accomplished over the 
ARPANET at 3500 bit/s between Culler-Harrison and Lincoln Laboratories. In 1976 the first LPC conference took 
place over the ARPANET using the Network Voice Protocol, between Culler-Harrison, ISI, SRI, and LL at 3500 
bit/s. And finally in 1978, Vishwanath el al. of BBN developed the first variable-rate LPC algorithm. 
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LPC coefficient representations 

LPC is frequently used for transmitting spectral envelope information, and as such it has to be tolerant of 
transmission errors. Transmission of the filter coefficients directly (see linear prediction for definition of 
coefficients) is undesirable, since they are very sensitive to errors. In other words, a very small error can distort the 
whole spectrum, or worse, a small error might make the prediction filter unstable. 

There are more advanced representations such as Log Area Ratios (LAR), line spectral pairs (LSP) decomposition 
and reflection coefficients. Of these, especially LSP decomposition has gained popularity, since it ensures stability of 
the predictor, and spectral errors are local for small coefficient deviations. 

Applications 

LPC is generally used for speech analysis and resynthesis. It is used as a form of voice compression by phone 
companies, for example in the GSM standard. It is also used for secure wireless, where voice must be digitized, 
encrypted and sent over a narrow voice channel; an early example of this is the US government's Navajo I. 

LPC synthesis can be used to construct vocoders where musical instruments are used as excitation signal to the 
time-varying filter estimated from a singer's speech. This is somewhat popular in electronic music. Paul Lansky 
made the well-known computer music piece notjustmoreidlechatter using linear predictive coding. [2] A lOth-order 
LPC was used in the popular 1980's Speak & Spell educational toy. 

Waveform ROM in some digital sample-based music synthesizers made by Yamaha Corporation may be compressed 
using the LPC algorithm. 

LPC predictors are used in Shorten, MPEG-4 ALS, FLAC, and other lossless audio codecs. 

Notes 

[1] Deng, Li; Douglas O'Shaughnessy (2003). Speech processing: a dynamic and optimization-oriented approach. Marcel Dekker. pp. 41-48. 
ISBN 0-8247-4040-8. 

[2] http://www.music.princeton.edu/~paul/liner_notes/morethanidlechatter.html 
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Vocoder 


A vocoder ( 4 /'voUkoUder/, short for voice encoder ) is an analysis/synthesis system, used to reproduce human 
speech. In the encoder, the input is passed through a multiband filter, each band is passed through an envelope 
follower, and the control signals from the envelope followers are communicated to the decoder. The decoder applies 
these (amplitude) control signals to corresponding filters in the (re)synthesizer. 

It was originally developed as a speech coder for telecommunications applications in the 1930s, the idea being to 
code speech for transmission. Its primary use in this fashion is for secure radio communication, where voice has to 
be encrypted and then transmitted. The advantage of this method of "encryption" is that no 'signal' is sent, but rather 
envelopes of the bandpass filters. The receiving unit needs to be set up in the same channel configuration to 
resynthesize a version of the original signal spectrum. The vocoder as both hardware and software has also been used 
extensively as an electronic musical instrument. 

Whereas the vocoder analyzes speech, transforms it into electronically transmitted information, and recreates it. The 
Voder (from Voice Operating Demonstrator ) generates synthesized speech by means of a console with fifteen 
touch-sensitive keys and a pedal, basically consisting of the "second half" of the vocoder, but with manual filter 
controls, needing a highly trained operator. 


Vocoder theory 

The human voice consists of sounds 
generated by the opening and closing 
of the glottis by the vocal cords, which 
produces a periodic waveform with 
many harmonics. This basic sound is 
then filtered by the nose and throat (a 
complicated resonant piping system) to 
produce differences in harmonic 
content (formants) in a controlled way, 
creating the wide variety of sounds 
used in speech. There is another set of 
sounds, known as the unvoiced and 
plosive sounds, which are created or 
modified by the mouth in different 
fashions. 

The vocoder examines speech by measuring how its spectral characteristics change over time. This results in a series 
of numbers representing these modified frequencies at any particular time as the user speaks. In simple terms, the 
signal is split into a number of frequency bands (the larger this number, the more accurate the analysis) and the level 
of signal present at each frequency band gives the instantaneous representation of the spectral energy content. Thus, 
the vocoder dramatically reduces the amount of information needed to store speech, from a complete recording to a 
series of numbers. To recreate speech, the vocoder simply reverses the process, processing a broadband noise source 
by passing it through a stage that filters the frequency content based on the originally recorded series of numbers. 
Information about the instantaneous frequency (as distinct from spectral characteristic) of the original voice signal is 
discarded; it wasn't important to preserve this for the purposes of the vocoder's original use as an encryption aid, and 
it is this "dehumanizing" quality of the vocoding process that has made it useful in creating special voice effects in 
popular music and audio entertainment. 
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Since the vocoder process sends only the parameters of the vocal model over the communication link, instead of a 
point by point recreation of the waveform, it allows a significant reduction in the bandwidth required to transmit 
speech. 


History 


Analog vocoders typically analyze an incoming signal 
by splitting the signal into a number of tuned frequency 
bands or ranges. A modulator and carrier signal are sent 
through a series of these tuned band pass filters. In the 
example of a typical robot voice the modulator is a 
microphone and the carrier is noise or a sawtooth 
waveform. There are usually between 8 and 20 bands. 

The amplitude of the modulator for each of the 
individual analysis bands generates a voltage that is 
used to control amplifiers for each of the corresponding 
carrier bands. The result is that frequency components 
of the modulating signal are mapped onto the carrier 
signal as discrete amplitude changes in each of the 
frequency bands. 

Often there is an unvoiced band or sibilance channel. 

This is for frequencies outside of analysis bands for 
typical speech but still important in speech. Examples are words that start with the letters s, f, ch or any other sibilant 
sound. These can be mixed with the carrier output to increase clarity. The result is recognizable speech, although 
somewhat "mechanical" sounding. Vocoders also often include a second system for generating unvoiced sounds, 
using a noise generator instead of the fundamental frequency. 





- SIGSALY (1943-1946) speech encipherment system 1 1 1a ' a - HY-2 

Hi 

Vocoder (designed in 1961), was the last generation of channel vocoder in the US. 

The first experiments with a vocoder were conducted in 1928 by Bell Labs engineer Homer Dudley, who was 

[41 

granted a patent for it on March 21, 1939. The Voder(Voice Operating Demonstrator), was introduced to the 
public at the AT&T building at the 1939-1940 New York World's Fair.^ The Voder consisted of a series of 
manually-controlled oscillators, filters, and a noise source. The filters were controlled by a set of keys and a foot 
pedal to convert the hisses and tones into vowels, consonants, and inflections. This was a complex machine to 
operate, but with a skilled operator could produce recognizable speech. 


[2][5] 


Dudley's vocoder was used in the SIGSALY system, which was built by Bell Labs engineers in 1943. SIGSALY was 
used for encrypted high-level voice communications during World War II. Later work in this field has been 
conducted by James Flanagan. 
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Vocoder applications 

• Terminal equipment for Digital Mobile Radio (DMR) based systems. 

• Digital Trunking 

• DMR TDM A 

• Digital Voice Scrambling and Encryption 

• Digital WLL 

• Voice Storage and Playback Systems 

• Messaging Systems 

• VoIP Systems 

• Voice Pagers 

• Regenerative Digital Voice Repeaters 

Modern vocoder implementations 

Even with the need to record several frequencies, and the additional unvoiced sounds, the compression of the 
vocoder system is impressive. Standard speech-recording systems capture frequencies from about 500 Hz to 
3400 Hz, where most of the frequencies used in speech lie, typically using a sampling rate of 8 kHz (slightly greater 
than the Nyquist rate). The sampling resolution is typically at least 12 or more bits per sample resolution (16 is 
standard), for a final data rate in the range of 96-128 kbit/s. However, a good vocoder can provide a reasonable good 
simulation of voice with as little as 2.4 kbit/s of data. 

'Toll Quality' voice coders, such as ITU G.729, are used in many telephone networks. G.729 in particular has a final 
data rate of 8 kbit/s with superb voice quality. G.723 achieves slightly worse quality at data rates of 5.3 kbit/s and 6.4 
kbit/s. Many voice systems use even lower data rates, but below 5 kbit/s voice quality begins to drop rapidly. 

Several vocoder systems are used in NSA encryption systems: 

• LPC-10, FIPS Pub 137, 2400 bit/s, which uses linear predictive coding 

• Code-excited linear prediction (CELP), 2400 and 4800 bit/s. Federal Standard 1016, used in STU-III 

• Continuously variable slope delta modulation (CVSD), 16 kbit/s, used in wide band encryptors such as the 
KY-57. 

• Mixed-excitation linear prediction (MELP), MIL STD 3005, 2400 bit/s, used in the Future Narrowband Digital 
Terminal FNBDT, NSA's 21st century secure telephone. 

• Adaptive Differential Pulse Code Modulation (ADPCM), former ITU-T G.721, 32 kbit/s used in STE secure 
telephone 

(ADPCM is not a proper vocoder but rather a waveform codec. ITU has gathered G.721 along with some other 
ADPCM codecs into G.726.) 

Vocoders are also currently used in developing psychophysics, linguistics, computational neuroscience and cochlear 
implant research. 

Modern vocoders that are used in communication equipment and in voice storage devices today are based on the 
following algorithms: 

• Algebraic code-excited linear prediction (ACELP 4.7 kbit/s — 24 kbit/s)® 

• Mixed-excitation linear prediction (MELPe 2400, 1200 and 600 bit/s)® 

• Multi-band excitation (AMBE 2000 bit/s - 9600 bit/s)® 

• Sinusoidal-Pulsed Representation (SPR 300 bit/s - 4800 bit/s)' 9 ' 

• Tri-wave excited linear prediction (TWELP 2400 - 3600 bit/s)® 1 " 
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Linear prediction-based vocoders 

Since the late 1970s, most non-musical vocoders have been implemented using linear prediction, whereby the target 
signal's specUal envelope (formant) is estimated by an all-pole HR filter. In linear prediction coding, the all-pole 
filter replaces the bandpass filter bank of its predecessor and is used at the encoder to whiten the signal (i.e., flatten 
the spectrum) and again at the decoder to re-apply the spectral shape of the target speech signal. 

One advantage of this type of filtering is that the location of the linear predictor's specUal peaks is entirely 
determined by the target signal, and can be as precise as allowed by the time period to be filtered. This is in contrast 
with vocoders realized using fixed-width filter banks, where spectral peaks can generally only be determined to be 
within the scope of a given frequency band. LP filtering also has disadvantages in that signals with a large number of 
constituent frequencies may exceed the number of frequencies that can be represented by the linear prediction filter. 
This restriction is the primary reason that LP coding is almost always used in tandem with other methods in 
high-compression voice coders. 

RAWCLI vocoder 

Robust Advanced Low Complexity Waveform Interpolation (RALCWI) technology uses proprietary signal 
decomposition and parameter encoding methods to provide high voice quality at high compression ratios. The voice 
quality of RALCWI-class vocoders, as estimated by independent listeners, is similar to that provided by standard 
vocoders running at bit rates above 4000 bit/s. The Mean Opinion Score (MOS) of voice quality for this Vocoder is 
about 3.5-3.6. This value was determined by a paired comparison method, performing listening tests of developed 
and standard voice Vocoders. {cnldate=march 2012}} 

The RALCWI vocoder operates on a “frame-by-frame” basis. The 20ms source voice frame consists of 160 samples 
of linear 16-bit PCM sampled at 8 kHz. The Voice Encoder performs voice analysis at the high time resolution (8 
times per frame) and forms a set of estimated parameters for each voice segment. All of the estimated parameters are 
quantized to produce 41-, 48- or 55-bit frames, using vector quantization (VQ) of different types. All of the vector 
quantizers were trained on a mixed multi-language voice base, which contains voice samples in both Eastern and 
Western languages. 

Waveform-Interpolative (WI) vocoder was developed in AT&T Bell Laboratories around 1995 by W.B. Kleijn, and 
subsequently a low- complexity version was developed by AT&T for the DoD secure vocoder competition. Notable 
enhancements to the WI coder were made at the University of California, Santa Barbara. AT&T holds the core 
patents related to WI, and other institutes hold additional patents. Using these patents as a part of WI coder 
implementation requires licensing from all IPR holders. 

The product is the result of a co-operation between CML Microcircuits and SPIRIT DSP. The co-operation combines 
CML’s 39-year history of developing mixed-signal semiconductors for professional and leisure communication 
applications, with SPIRIT’s experience in embedded voice products. 

Voice effects in music 

For musical applications, a source of musical sounds is used as the carrier, instead of extracting the fundamental 
frequency. For instance, one could use the sound of a synthesizer as the input to the filter bank, a technique that 
became popular in the 1970s. 

Musical history 

One of the earliest person who recognized the possibility of Vocoder/Voder on the electronic music may be Werner 
Meyer-Eppler, a German physicist/experimental acoustician/phoneticist. In 1949, he published thesis on the 
electronic music and speech synthesis from the viewpoint of sound synthesis,' 11 ' and in 1951, he joined to the 
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successful proposal of establishment of WDR Cologne Studio for Electronic Music. 
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One of the first attempt to divert vocoder to create music may be a 
“Siemens Synthesizer” at Siemens Studio for Electronic Music, 
developed between 1956-1959.^ 13 ^ 14 " 

In 1968, Robert Moog developed one of the first solid-state musical 
vocoder for electronic music studio of University at Buffalo.' 1 7 ' 

In 1969, Bruce Haack built a prototype vocoder, named "Farad" after 
Michael Faraday,' 16 and it was featured on his rock album The 
Electric Lucifer released in the same year." 17 "" 18 " 

In 1970 Wendy Carlos and Robert Moog built another musical 
vocoder, a 10-band device inspired by the vocoder designs of Homer 
Dudley. It was originally called a spectrum encoder-decoder, and later 
referred to simply as a vocoder. The carrier signal came from a Moog modular synthesizer, and the modulator from a 
microphone input. The output of the 10-band vocoder was fairly intelligible, but relied on specially articulated 
speech. Later improved vocoders use a high-pass filter to let some sibilance through from the microphone; this ruins 
the device for its original speech-coding application, but it makes the "talking synthesizer" effect much more 
intelligible. 

Carlos and Moog's vocoder was featured in several recordings, including the soundtrack to Stanley Kubrick's A 
Clockwork Orange in which the vocoder sang the vocal part of Beethoven's "Ninth Symphony". Also featured in the 
soundtrack was a piece called "Timesteps," which featured the vocoder in two sections. "Timesteps" was originally 
intended as merely an introduction to vocoders for the "timid listener", but Kubrick chose to include the piece on the 
soundtrack, much to the surprise of Wendy Carlos. 

Kraftwerk's Autobahn (1974) was one of the first successful pop/rock albums to feature vocoder vocals. Another of 
the early songs to feature a vocoder was "The Raven" on the 1976 album Tales of Mystery and Imagination by 
progressive rock band The Alan Parsons Project; the vocoder also was used on later albums such as I Robot. 
Following Alan Parsons' example, vocoders began to appear in pop music in the late 1970s, for example, on disco 
recordings. Jeff Lynne of Electric Light Orchestra used the vocoder in several albums such as Time (featuring the 
Roland VP-330 Plus Mkl). ELO songs such as "Mr. Blue Sky" and "Sweet Talkin' Woman" both from Out of the 
Blue (1977) use the vocoder extensively. Featured on the album are the EMS Vocoder 2000W Mkl, and the EMS 
Vocoder (-System) 2000 (W or B, Mkl or II). 

Giorgio Moroder made extensive use of the vocoder on the 1975 album Einzelganger and on the 1977 album From 
Here to Eternity. Another example is Pink Floyd's album Animals , where the band put the sound of a barking dog 
through the device. Vocoders are often used to create the sound of a robot talking, as in the Styx song "Mr. Roboto". 
It was also used for the introduction to the Main Street Electrical Parade at Disneyland. 

Vocoders have appeared on pop recordings from time to time ever since, most often simply as a special effect rather 
than a featured aspect of the work. However, many experimental electronic artists of the New Age music genre often 
utilize vocoder in a more comprehensive manner in specific works, such as Jean Michel Jarre (on Zoolook, 1984) 
and Mike Oldfield (on QE2, 1980 and Five Miles Out, 1982). There are also some artists who have made vocoders 
an essential part of their music, overall or during an extended phase. Examples include the German synthpop group 
Kraftwerk, Stevie Wonder ("Send One Your Love", "A Seed's a Star") and jazz/fusion keyboardist Herbie Hancock 
during his late 1970s period. 

ri 9 i 

In 1982 Neil Young used a Sennheiser Vocoder VSM201 on six of the nine tracks on Trans. 
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Voice effects 

"Robot voices" became a recurring element in popular music during the 20th century. Apart from vocoders, seeveral 

other methods of producing variations on this effect include: the Sonovox, Talk box, and Auto-Tune] 20 ^ linear 

\2U\22\ 

prediction vocoders, speech synthesis, ring modulation and comb filter. 

Vocoders are used in television production, filmmaking and games, usually for robots or talking computers. The 
Cylons from Battlestar Galactica used an EMS Vocoder 2000' 1 ^ to create their voice-effects. The 1980 version of 
the Doctor Who theme has a section generated by a Roland SVC-350 Vocoder. 

Synthesizer voice 

In 1972, Isao Tomita's first electronic music album Electric Samurai: Switched on Rock was an early attempt at 

applying speech synthesis technique in electronic rock and pop music. The album featured electronic renditions of 

contemporary rock and pop songs, while utilizing synthesized voices in place of human voices. In 1974, he utilized 

synthesized voices again in his popular classical music album Snowflakes are Dancing , which became a worldwide 

• [231 

success and helped popularize electronic music. 
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Phase vocoder 


A phase vocoder is a type of vocoder which can scale both the frequency and time domains of audio signals by 
using phase information. The computer algorithm allows frequency-domain modifications to a digital sound file 
(typically time expansion/compression and pitch shifting). 

At the heart of the phase vocoder is the short-time Fourier transform (STFT), typically coded using fast Fourier 
transforms. The STFT converts a time domain representation of sound into a time-frequency representation (the 
"analysis" phase), allowing modifications to the amplitudes or phases of specific frequency components of the 
sound, before resynthesis of the frequency domain representation into the time domain by the inverse STFT. The 
time evolution of the resynthesized sound can be changed by means of modifying the time position of the STFT 
frames prior to the resynthesis operation allowing for time-scale modification of the original sound file. 

Phase coherence problem 

The main problem that has to be solved for all case of manipulation of the STFT is the fact that individual signal 
components (sinusoids, impulses) will be spread over multiple frames and multiple STFT frequency locations (bins). 
This is because the STFT analysis is done using overlapping analysis windows. The windowing results in spectral 
leakage such that the information of individual sinusoidal components is spread over adjacent STFT bins. To avoid 
border effects of tapering of the analysis windows STFT analysis windows overlap in time. This time overlap results 
in the fact that adjacent STFT analysis are strongly correlated (a sinusoid present in analysis frame at time "t" will be 
present in the subsequent frames as well). The problem of signal transformation with the phase vocoder is related to 
the problem that all modifications that are done in the STFT representation need to preserve the appropriate 
correlation between adjacent frequency bins (vertical coherence) and time frames (horizontal coherence). Besides for 
extremely simple synthetic sounds these appropriate correlations can only be preserved approximately and since the 
invention of the phase vocoder the research was mainly concerned with finding algorithms that would preserve the 
vertical and horizontal coherence of the STFT representation after the modification. For time scaling operations 
amplitude coherence is only a minor problem because shifting analysis frames in time has only a minor impact on 
the amplitude. The phase coherence problem has been tackled for quite a while before appropriate solutions have 
emerged. 

History 

The phase vocoder was introduced in 1966 by Flanagan as an algorithm that would preserve horizontal coherence 
between the phases of bins that represent sinusoidal components J 1 This original phase vocoder did not take into 
account the vertical coherence between adjacent frequency bins, and therefore, time stretching with this system did 
produce sound signals that were missing clarity. 

The optimal reconstruction of the sound signal from STFT after amplitude modifications has been proposed by 

F2i 

Griffin and Lim in 1984. This algorithm does not consider the problem to produce a coherent STFT, but it allows 
to find the sound signal that has an STFT that is as close as possible to the modified STFT even if the modified 
STFT is not coherent (does not represent any signal). 

The problem of the vertical coherence remained a major issue for the quality of time scaling operations until 1999 

F3i 

when the Laroche and Dolson proposed a rather simple means to preserve phase consistency across spectral bins. 
The proposition of Laroche and Dolson has to be seen as a turning point in phase vocoder history. It has been shown 
that by means of ensuring vertical phase consistency very high quality time scaling transformations can be obtained. 

The algorithm proposed by Laroche did not allow to preserve horizontal phase coherence for sound onsets (note 
onsets). A solution for this problem has been proposed by Roebel J 4 
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A software implementation of the phase vocoder based signal transformation that is using means similar to what has 
been described here above to achieve high quality signal transformation is for example Ircam's Super VP 

Use in music 

British composer Trevor Wishart used phase vocoder analyses and transformations of a human voice as the basis for 

his composition VOX 5 (part of his larger VOX Cycle).^ Transfigured Wind by American composer Roger 

T71 

Reynolds uses the phase vocoder to perform time-stretching of flute sounds. 

The proprietary Auto-Tune pitch-correcting software, widely used in commercial music production, is based on the 
phase vocoder principle. 
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Audio timescale-pitch modification 


Time stretching is the process of changing the speed or duration of an audio signal without affecting its pitch. Pitch 
scaling or pitch shifting is the opposite: the process of changing the pitch without affecting the speed. There are also 
more advanced methods used to change speed, pitch, or both at once, as a function of time. 

These processes are used, for instance, to match the pitches and tempos of two pre-recorded clips for mixing when 
the clips cannot be reperformed or resampled. (A drum track containing no pitched instruments could be moderately 
resampled for tempo without adverse effects, but a pitched track could not). They are also used to create effects such 
as increasing the range of an instrument (like pitch shifting a guitar down an octave). 

Resampling 

The simplest way to change the duration or pitch of a digital audio clip is to resample it. This is a mathematical 
operation that effectively rebuilds a continuous waveform from its samples and then samples that waveform again at 
a different rate. When the new samples are played at the original sampling frequency, the audio clip sounds faster or 
slower. Unfortunately, the frequencies in the sample are always scaled at the same rate as the speed, transposing its 
perceived pitch up or down in the process. In other words, slowing down the recording lowers the pitch, speeding it 
up raises the pitch, and the two effects cannot be separated. This is analogous to speeding up or slowing down an 
analogue recording, like a phonograph record or tape, creating the Chipmunk effect. 

Phase vocoder 

One way of stretching the length of a signal without affecting the pitch is to build a phase vocoder after Flanagan, 
Golden, and Portnoff. 

Basic steps: 

1. compute the instantaneous frequency/amplitude relationship of the signal using the STFT, which is the discrete 
Fourier transform of a short, overlapping and smoothly windowed block of samples; 

2. apply some processing to the Fourier transform magnitudes and phases (like resampling the FFT blocks); and 

3. perform an inverse STFT by taking the inverse Fourier transform on each chunk and adding the resulting 
waveform chunks. 

The phase vocoder handles sinusoid components well, but early implementations introduced considerable smearing 
on transient ("beat") waveforms at all non-integer compression/expansion rates, which renders the results phasey and 
diffuse. Recent improvements allow better quality results at all compression/expansion ratios but a residual smearing 
effect still remains. 

The phase vocoder technique can also be used to perform pitch shifting, chorusing, timbre manipulation, 
harmonizing, and other unusual modifications, all of which can be changed as a function of time. 
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Time domain 

SOLA 

Rabiner and Schafer in 1978 put forth an alternate solution that works in the time domain: attempt to find the period 
(or equivalently the fundamental frequency) of a given section of the wave using some pitch detection algorithm 
(commonly the peak of the signal's autocorrelation, or sometimes cepstral processing), and crossfade one period into 
another. 

This is called time domain harmonic scaling^ or the synchronized overlap-add method (SOLA) and performs 
somewhat faster than the phase vocoder on slower machines but fails when the autocorrelation mis-estimates the 
period of a signal with complicated harmonics (such as orchestral pieces). 

Adobe Audition (formerly Cool Edit Pro) seems to solve this by looking for the period closest to a center period that 
the user specifies, which should be an integer multiple of the tempo, and between 30 Hz and the lowest bass 
frequency. 

This is much more limited in scope than the phase vocoder based processing, but can be made much less processor 
intensive, for real-time applications. It provides the most coherent results for single-pitched sounds like voice or 
musically monophonic instrument recordings. 

High-end commercial audio processing packages either combine the two techniques (for example by separating the 
signal into sinusoid and transient waveforms), or use other techniques based on the wavelet transform, or artificial 
neural network processing, producing the highest-quality time stretching. 

Untangling phase and time 

Another way to shift pitch and stretch time is to separate phase and 
time in a monophonic sound such as the ones of melody 
instruments. By altering only the time control, you can stretch, 
shrink or reverse time, or generate loops as needed in sampling 
synthesizers. Time shrinkage can also be used for compression 
purposes. By altering only the phase control, you can shift the 
pitch or apply FM synthesis distortions to an existing sound. This 
can be used to play instruments alternatively to wavetable 
synthesis. 

For controlling phase and time independently we would need to know the displacement of the sound for every pair of 

phase and time position. This corresponds to a cylinder as shown in the figure. However, a sound signal is a 

one-dimensional signal. You can consider this sound signal as observation of the full function on the cylinder. This 

is drawn as black line in the figure. The full function on the cylinder can be approximated by interpolating between 

points on the helix with (approximately) the same phase. From this function a different sound signal can be derived. 

E.g. in the figure the grey line shows the path of a sound that has the same time progression but a frequency lower 

than the original one, or a sound that has the same frequency and a faster time progression, or something between. In 

the end the whole process can be implemented for discrete sound signals as interpolation between values with similar 
T21 

phase and similar time. 

T31 

The described technique is used in the monophonic version of the software Melodyne 
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Modelling a monophonic sound as observation along a 
helix of a function with a cylinder domain 
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Sinusoidal/Spectral Modeling 

Another alternative method for time stretching relies on a spectral model of the signal. In this method, peaks are 
identified in frames using the STFT of the signal, and sinusoidal "tracks" are created by connecting peaks in adjacent 
frames. The tracks are then re-synthesized at a new time scale. This method can yield good results on both 
polyphonic and percussive material, especially when the signal is separated into sub-bands. However, this method is 
more computationally demanding than other methods. 

Speed hearing & Speed Talking 

For the specific case of speech, time stretching can be performed using PSOLA. 

Time stretching can be used with audio books and recorded lectures. Slowing down may improve comprehension of 
foreign languages [4], 

While one might expect speeding up to reduce comprehension. Herb Friedman says that "Experiments have shown 
that the brain works most efficiently if the information rate through the ears—via speech—is the "average" reading 
rate, which is about 200-300 wpm (words per minute), yet the average rate of speech is in the neighborhood of 
100-150 wpm." [5] 

Speeding up audio is seen as the equivalent of "speed reading" ^ ' 7 L 

Time stretching is often used to adjust Radio commercials [8] and the audio of Television advertisements [5] to fit 
exactly into the 30 or 60 seconds available. 

Pitch scaling 

These techniques can also be used to transpose an audio sample while holding speed or duration constant. This may 
be accomplished by time stretching and then resampling back to the original length. Alternatively, the frequency of 
the sinusoids in a sinusoidal model may be altered directly, and the signal reconstructed at the appropriate time scale. 
Transposing can be called frequency scaling or pitch shifting, depending on perspective. 

For example, one could move the pitch of every note up by a perfect fifth, keeping the tempo the same. One can 
view this transposition as "pitch shifting", "shifting" each note up 7 keys on a piano keyboard, or adding a fixed 
amount on the Mel scale, or adding a fixed amount in linear pitch space. One can view the same transposition as 
"frequency scaling", "scaling" (multiplying) the frequency of every note by 3/2. 

Musical transposition preserves the ratios of the harmonic frequencies that determine the sound's timbre, unlike the 
frequency shift performed by amplitude modulation, which adds a fixed frequency offset to the frequency of every 
note. (In theory one could perform a literal pitch scaling in which the musical pitch space location is scaled [a higher 
note would be shifted at a greater interval in linear pitch space than a lower note], but that is highly unusual, and not 
musical). 

Time domain processing works much better here, as smearing is less noticeable, but scaling vocal samples distorts 
the formants into a sort of Alvin and the Chipmunks-like effect, which may be desirable or undesirable. A process 
that preserves the formants and character of a voice involves analyzing the signal with a channel vocoder or LPC 
vocoder plus any of several pitch detection algorithms and then resynthesizing it at a different fundamental 
frequency. 

A detailed description of older analog recording techniques for pitch shifting can be found within the Alvin and the 
Chipmunks entry. 
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